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The cellular tree classifier model addresses a fundamental problem in 
the design of classifiers for a parallel or distributed computing world: 
Given a data set, is it sufficient to apply a majority rule for classifica- 
tion, or shall one split the data into two or more parts and send each 
part to a potentially different computer (or cell) for further process- 
ing? At first sight, it seems impossible to define with this paradigm 
a consistent classifier as no cell knows the "original data size", n. 
However, we show that this is not so by exhibiting two different con- 
sistent classifiers. The consistency is universal but is only shown for 
distributions with nonatomic marginals. 
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lular computation, Bayes risk consistency, asymptotic analysis, non- 
parametric estimation. 
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1 Introduction 



We explore in this paper a new way of dealing with the supervised classifi- 
cation problem. In the model we have in mind, a basic computational unit 
in classification, a cell, takes as input training data, and makes a decision 
whether a majority rule should be applied to all data, or whether the data 
should be split, and each part of the partition should be given to another cell. 
All cells must be the same — their function is not altered by external inputs. 
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In other words, the decision to split depends only upon the data presented to 
the cell. Classifiers designed according to this autonomous principle will be 
called cellular tree classifiers, or simply cellular classifiers. This manner of 
tackling the classification problem is novel, but has a wide reach in a world in 
which parallel and distributed computation are important. In the short term, 
parallelism will take hold in massive data sets and complex systems and, as 
such, is one of the exciting questions that will be asked to the statistics and 
machine learning fields. 

The purpose of the present document is to formalize the setting and to pro- 
vide a foundational discussion of various properties, good and bad, of tree 
classifiers that are formulated following these principles. Our constructions 
lead to classifiers that always converge. They are the first consistent cel- 
lular classifiers that we are aware of. This article is also motivated by the 
challenges involved in "big data" issues (see, e.g., Jordan [23]), in which re- 
cursive approaches such as divide-and-conquer algorithms (e.g., Cormen et 
al. [7]) play a central role. Such procedures are naturally adapted for execu- 
tion in multi-processor machines, especially shared-memory systems where 
the communication of data between processors does not need to be planned 
in advance. 

In the design of classifiers, we have an unknown distribution of a random 
prototype pair (X, Y), where X takes values in IR d and Y takes only finitely 
many values, say or 1 for simplicity. Classical pattern recognition deals 
with predicting the unknown nature Y of the observation X via a measurable 
classifier g : IR d — > {0, 1}. Since it is not assumed that X fully determines the 
label, it is certainly possible to misspecify its associated class. Thus, we err 
if g(X) differs from Y, and the probability of error for a particular decision 
rule g is L(g) = P{g(X) ^ Y}. The Bayes classifier 



(See, for instance, Devroye, Gyorfi, and Lugosi [JOj Theorem 2.1].) However, 
most of the time, the distribution of (X, Y) is unknown, so that g* is un- 
known too. Fortunately, it is often possible to collect a sample (the data) 
V n = ((Xi, Yi), . . . , (X n , Y n )) of independent and identically distributed 
(i.i.d.) copies of (X, Y). We assume that V n and (X, Y) are independent. 
In this context, a classifier g„(x;D n ) is a measurable function of x and V n , 




1 if ¥{Y = 1|X = x} > F{Y = 0|X = x} 



otherwise 



has the smallest probability of error, that is 
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and it attempts to estimate Y from X and T> n . For simplicity, we suppress 
T> n in the notation and write g n {x) instead of g n (x; T> n ). 

The probability of error of a given classifier g n is the random variable 

L(g n ) = F{g n (X) ^ Y\V n }, 
and the rule is consistent if 

lim EL(g n ) = V. 

n— >oo 

It is universally consistent if it is consistent for all possible distributions of 
(X, Y). Many popular classifiers are universally consistent. These include 
several brands of histogram rules, fc-nearest neighbor rules, kernel rules, neu- 
ral networks, and tree classifiers. There are too many references to be cited 
here, but the monographs by Devroye, Gyorfi, and Lugosi [10J and Gyorfi 
et al. [21] will provide the reader with a comprehensive introduction to the 
domain and a literature review. Among these rules, tree methods loom large 
for several reasons. All procedures that partition space, such as histogram 
rules, can be viewed as special cases of partitions generated by trees. Simple 
neural networks that use voting methods can also be regarded as trees, and 
similarly, kernel methods with kernels that are indicator functions of sets are 
but special cases of tree methods. Tree classifiers are conceptually simple, 
and explain the data very well. However, their design can be cumbersome, as 
optimizations performed over all possible tree classifiers that follow certain 
restrictions could face a huge combinatorial and computational hurdle. The 
cellular paradigm addresses these concerns. 

Remark 1.1 Partitions of M d based upon trees have been studied in the 
computational geometry literature (Bentley [4;L Overmars IMJtf , Edelsbrunner 
and van Leeuwen FTBj . Mehlhorn [21]) and the computer graphics literature 
(Samet [35, 36]). Most popular among these are the k-d trees and quadtrees. 
Our version of space partitioning corresponds to Bentley 's k-d trees (Bent- 
ley %4}j). The basic notions of trees as related to pattern recognition can be 
found in Chapter 20 of Devroye, Gyorfi, and Lugosi fTUtf . However, trees have 
been suggested as tools for classification more than twenty years before that. 
We mention in particular the early work of K.S. Fu (You and Fu JJ^jj, An- 
derson and Fu /!]/, Mui and Fu \29l . Lin and Fu 125^ . Qing-Yun and Fu \3J$). 
Other references from the 1970s include Meisel and Michalopoulos \2~8] , Bar- 
tolucci, Swain, and Wu [3], Payne and Meisel [32], Sethi and Chatterjee \37^ , 
Swain and Hauska Gordon and Olshen JTffl . and Friedman [T$ ). Most 
influential in the classification tree literature was the CART proposal by 
Breiman et al. |5]/. While CART proposes partitions by hyperrectangles, linear 
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hyperplanes in general position have also gained in popularity — the early work 
on that topic is by Loh and Vanichsetakul I2bj , and Park and Sklansky 131^ . 
Additional references on tree classification include Gustafson, G elf and, and 
Mitter fNtf . Argentiero, Chin, and Beaudet JE/, Hartmann et al. Wty , Kurzyn- 
ski \2J$ , Wang and Suen Suen and Wang JJ^j , Shlien IES$ , Chou JMj, 
Gelfand and Delp JTffl . Gelfand, Ravishankar, and Delp fTbJ . Simon I3$j . and 
Guo and Gelfand fTVf . 



2 Cellular tree classifiers 

2.1 The cellular computation spirit 

In general, classification trees partition IR d into regions, often hyperrectangles 
parallel to the axes (an example is depicted in Figure [1]). In t-ary trees, 
each node has exactly t or children. If a node u represents the set A 
and its children ui, . . . , u t represent A 1: . . . , A t , then it is required that A = 
Ai U . . . U A t and A{ n Aj = for i ^ j. The root of the tree represents M. d , 
and the terminal nodes (or leaves), taken together, form a partition of IR d . If 
a leaf represents region A, then the tree classifier takes the simple form 

n = ! 1 if 1 [x ! eA,y I =i] > XT=i l[x l6 A,y l= o], x G A 

ynK ' \ otherwise. 

That is, in every leaf region, a majority vote is taken over all (Xj, Yi)'s with 
Xj's in the same region. Ties are broken, by convention, in favor of class 0. 
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The tree structure is usually data-dependent, as well, and indeed, it is in 
the construction itself where different trees differ. Thus, there are virtually 
infinitely many possible strategies to build classification trees. Nevertheless, 
despite this great diversity, all tree species end up with two fundamental 
questions at each node: 



® Should the node be split? 

© In the affirmative, what are its children? 



These two questions are typically answered using global information regard- 
ing the tree, such as, for example, a function of the data D n , the level of the 
node within the tree, the size of the data set and, more generally, any pa- 
rameter connected with the structure of the tree. This parameter could be, 
for example, the total number k of cells in a fc-partition tree or the penalty 
term in the pruning of the CART algorithm (Breiman et al. [5] ; see also Gey 
and Nedelec [II]). 

Cellular trees proceed from a different philosophy. In short, a cellular tree 
should, at each node, be able to answer questions ® and © using local 
information only, without any help from the other nodes. In other words, 
each cell can perform as many operations it wishes, provided it uses only 
the data that are transmitted to it, regardless of the general structure of the 
tree. Just imagine that the calculations to be carried out at the nodes are 
sent to different computers, eventually asynchronously, and that the system 
architecture is so complex that computers do not communicate. Such a 
situation may arise, for example, in the context of massive data sets, that 
is, when both n and d are astronomical, and no single human and no single 
computer can handle this alone. Thus, once a computer receives its data, 
it has to make its own decisions ® and © based on this data subset only, 
independently of the others and without knowing anything of the overall 
edifice. Once a data set is split, it can be given to another computer for 
further splitting, since the remaining data points have no influence. This 
greedy mechanism is schematized in Figure [2] 

But there is a more compelling reason for making local decisions. A neurol- 
ogist seeing twenty patients must make decisions without knowing anything 
about the other patients in the hospital that were sent to other specialists. 
Neither does he need to know how many other patients there are. The neu- 
rologist's decision, in other words, should only be based on the data — the 
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Figure 2: Schematization of the cell, the computational unit. 



patients — in his care. 



2.2 A mathematical model 

The objective of this subsection is to discuss a tentative mathematical model 
for cellular tree classifiers. Without loss of generality, we consider binary tree 
classifiers based on a class C of possible Borel subsets of M. d that can be used 
for splits. A typical example of such a class is the family of all hyperplanes, 
or the class of all hyperplanes that are perpendicular to one of the axes. 
Higher order polynomial splitting surfaces can be imagined as well. 

The class is parametrized by a vector a G W 3 . There is a splitting function 
/(x,cr), x G IV G IR P , such that R d is partitioned into A = {x G M. d : 
/(x, a) > 0} and B = {x G M d : /(x, a) < 0}. Formally, a cellular split can 
be viewed as a family of measurable mappings a from (R d x {0, l}) n to W 
(for all n > 1). That is, for each possible input size n, we have a map. In 
addition, there is a family of measurable mappings 9 from (M. d x {0, 1})™ to 
{0, 1} that indicate decisions: 9 = 1 indicates that a split should be applied, 
while 9 = corresponds to a decision not to split. In that case, the cell acts 
as a leaf node in the tree. Note that 9 and a correspond to the decisions 
given in ® and ©. 

A cellular binary classification tree is a machine that partitions the space 
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recursively in the following manner. With each node we associate a subset of 
R d , starting with M d for the root node. Let the data set be V n . If 9(V n ) = 0, 
the root cell is final, and the space is not split. Otherwise, M. d is split into 

A = {x G R d : /(x,<r(X> n )) > 0} and B={xGl i :/(x, ( r(D„))<0}. 

The data T> n are partitioned into two groups — the first group contains all 
(Xj,Fj), i = 1, . . . ,n, for which X, G A, and the second group all others. 
The groups are sent to child cells, and the process is repeated. 

A priori, there is no reason why this tree should be finite. We will impose 
conditions later on that ensure that with probability 1, the tree is finite for 
all n and for all possible values of the data. For example, this could be 
achieved by hyperplane splits perpendicular to the axes that are forced to 
visit (contain) one of the Xj's. By insisting that the data point selected 
on the boundary be "eaten", i.e., not sent down to the child nodes, one 
reduces the data set by one at each split, thereby ensuring the finiteness of 
the decision tree. We will employ such a (crude) method. 

When x G W 1 needs to be classified, we first determine the unique leaf set 
-A(x) to which x belongs, and then take votes among the : Xj G A(x),i = 
1, . . . , n}. Classification proceeds by a majority vote, with the majority de- 
ciding the estimate g„(x). In case of a tie, we set g„(x) = 0. 

A cellular binary tree classifier is said to be randomized if each node in the 
tree has an independent copy of a uniform [0, 1] random variable associated 
with it, and 9 and a are mappings that have one extra real- valued component 
in the input. For example, we could flip an unbiased coin at each node to 
decide whether 9 = or 9 = 1. 

Remark 2.1 It is tempting to say that any classifier g n is a cellular tree 
classifier with the following mechanism: Set 9 = 1 if we are at the root, and 
9 = elsewhere. The root node is split by the classifier into a set 

A={xGl d : <? n (x) = 1} 

and its complement, and both child nodes are leaves. However, the decision 
to cut can only be a function of the input data, and not the node 's position 
in the tree, and thus, this is not allowed. 

2.3 Are there consistent cellular tree classifiers? 

At first sight, it appears that there are no universally consistent cellular tree 
classifiers. Consider for example complete binary trees with k full levels, 
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i.e., there are 2 k leaf regions. We can have consistency when k is allowed 
to depend upon n. An example is the median tree (Devroye, Gyorfi, and 
Lugosi [lOj Section 20.3]). When d — 1, split by finding the median element 
among the Xj's, so that the child sets have cardinality given by \_(n — l)/2j 
and \(n — 1)/2~|, where |_-J and [.] are the floor and ceiling functions. The 
median itself does stay behind and is not sent down to the subtrees, with an 
appropriate convention for breaking cell boundaries as well as empty cells. 
Keep doing this for k rounds — in d dimensions, one can either rotate through 
the coordinates for median splitting, or randomize by selecting uniformly at 
random a coordinate to split orthogonally. 

This rule is known to be consistent as soon as the marginal distributions of 
X are nonatomic, provided k — > oo and k2 k /n — > 0. However, this is not 
a cellular tree classifier. While we can indeed specify a, it is impossible to 
define 9 because 9 cannot be a function of the global value of n. In other 
words, if we were to apply median splitting and decide to split for a fixed 
k, then the leaf nodes would all correspond to a fix proportion of the data 
points. It is clear that the decisions in the leaves are off with a fair probability 
if we have, for example, Y independent of X and F{Y = 1} = 1/2. Thus, 
we cannot create a cellular tree classifier in this manner. 

In view of the preceding discussion, it seems paradoxical that there indeed ex- 
ist universally consistent cellular tree classifiers. (We note here that we abuse 
the word "universal" — we will assume throughout, to keep the discussion at 
a manageable level, that the marginal distributions of X are nonatomic. But 
no other conditions on the joint distribution of (X, Y) are imposed.) Our 
first construction, which is presented in Section 3, follows the median tree 
principle and uses randomization. In a second construction (Section 4) we 
derandomize, and exploit the idea that each cell is allowed to explore its own 
subtrees, thereby anticipating the decisions of its children. For the sake of 
clarity, proofs of the most technical results are gathered in Section 5. 

3 A randomized cellular tree classifier 

From now on, to keep things simple, it is assumed that the marginal dis- 
tributions of X are nonatomic. The cellular splitting method a described 
in this section mimics the median tree classifier discussed above. We first 
choose a dimension to cut, uniformly at random from the d dimensions, as 
rotating through the dimensions by level number would violate the cellular 
condition. The selected dimension is then split at the data median, just as 
in the classical median tree. Repeating this for k levels of nodes leads to 2 k 
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leaf regions. On any path of length k to one of the 2 k leaves, we have a de- 
terministic sequence of cardinalities n = ra(root), n 1? n 2 , . . . , n^. We always 
have rii/2 — 1 < n^+i < rij/2. Thus, by induction, one easily shows that, for 
all i, 

n n 
— -2 < m < — . 

2 % ~ 2 l 

In particular, each leaf has at least max(n/2 k — 2, 0) points and at most n/2 k . 

Remark 3.1 The problem of atoms in the coordinates can be dealt with sep- 
arately, but still within the cellular framework. The particularity is that the 
threshold for splitting may now be at a position at which one or more data 
values occur. This leaves two sets that may differ in size by more than one. 
The atoms in the distribution of X can never be separated, but that is at it 
should be. We leave it to the reader to adapt the subsequent arguments to the 
case of atomic distributions. 

The novelty is in the choice of the decision function. This function ignores the 
data altogether and uses a randomized decision that is based on the size of 
the input. More precisely, consider a nonincreasing function tp : N — > (0, 1] 
with (p(0) = (p(l) = 1. Then, if U is the uniform [0,1] random variable 
associated with the cell, and the input to the cell is V n , 



In this manner, we obtain a possibly infinite randomized binary tree classifier. 
Splitting occurs with probability 1 — ip(n) on inputs of size n. Note that no 
attempt is made to split empty sets or singleton sets. For consistency, we 
need to look at the random leaf region to which X belongs. This is roughly 
equivalent to studying the distance from that cell to the root of the tree. 

In the sequel, the notation u n = o(v n ) (respectively, u n = u>(v n )) means that 
u n /v n (respectively, v n /u n — > 0) as n — > oo. Many choices <p(n) = o(l), 
but not all, will do for us. The next lemma makes things more precise. 

Lemma 3.1 Let (5 E (0, 1). Define 



Let K(X) denote the random path distance between the cell o/X and the root 
of the tree. Then 



9(Pm U) = l[ f/>v ,( n )]. 
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Proof of Lemma 13.11 Let us recall that, at level k, each cell of the un- 
derlying median tree contains at least max(n/2 fc — 2, 0) points and at most 
n/2 k . Since the function </?(.) is nonincreasing, the first result follows from 
this: 

k n 1 

¥{K(X)>k n }< J] (l- V (ln/?\)) 

i=0 



<exp -JXln/tfJ) 



i=0 



< exp (-& n </?(n)) . 
The second statement follows from 



-l 



¥{K(X) < k n } <J2^ (\ n / T - 2 D < fenV (rn/2*"!) 



i=0 



valid for all n large enough since n/2 n — > oo as n - >• oo. ■ 

Lemma I3.1[ combined with the median tree consistency result of Devroye, 
Gyorfi, and Lugosi [1 0J , suffices to establish consistency of the randomized 
cellular tree classifier. 



Theorem 3.1 Let (3 be a real number in (0, 1). Define 
(p(n) = 



1 ifn<3 
1 /log 13 n if n > 3. 

Let g n be the associated randomized cellular binary tree classifier. Assume 
that the marginal distributions of X are nonatomic. Then the classification 
rule g n is consistent: 

lim KL(g n ) = L* asn-> oo. 

n— ¥oo 

Proof of Theorem 13.11 Cells correspond in a natural way to sets of M. d . 
So, we can and will speak of a cell A, where A C R d . The number of data 
points in A is denoted by N(A): 



n(A) = J2m^a]. 



i=l 



By diam(A) we mean the diameter of the cell A, i.e., the maximal distance 
between two points of A. We recall a general consistency theorem for par- 
titioning classifiers whose cell design depends on the Xj's only (Devroye, 
Gyorfi, and Lugosi [TOl Theorem 6.1]). According to this theorem, such a 
classifier is consistent if both 
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1. diam(v4(X)) — > in probability as n — > oo, and 



2. N(A(X.)) — > oo in probability asn-} oo, 



where A(X) is the cell of the random partition containing X. 



Condition 2. is proved in Lemma 13.11 Notice that 



N(A(X))> 



n 



- 2 



71 



> 1 



[if(X)<log(' 3 + 1 >/ 2 n] 



■G8+l)/2 



2 



2 lo s' 



[i<:(X)<log(' 3 + 1 )/ 2 n]- 



Therefore, by Lemma 13.1 [ iV (A(X)) — >■ oo in probability as n — > oo. 

To show that diam(yl(X)) — > in probability, observe that on a path of 
length K(K), the number of times the first dimension is cut is binomial 
(K(K), 1/d). This tends to infinity in probability. Following the proof of 
Theorem 20.2 in [TU], the diameter of the cell of X tends to in probability 
with n. Details are left to the reader. ■ 

Let us finally take care of the randomization. Can one do without random- 
ization? The hint to the solution of that enigma is in the hypothesis that the 
data elements in D n are i.i.d. The median classifier does not use the ordering 
in the data. Thus, one can use the randomness present in the permutation of 
the observations, e.g., the £-th components of the Xj's can form n\ permuta- 
tions if ties do not occur. This corresponds to (1 + o(l))nlog 2 n independent 
fair coin flips, which are at our disposal. Each decision to split requires on 
average at most 2 independent bits. The selection of a random direction to 
cut requires no more than 1 + log 2 d independent bits. Since the total tree 
size is, with probability tending to 1, 0{2 Xogl3+e n ) for any s > 0, a fact that 
follows with a bit of work from summing the expected number of nodes at 
each level, the total number of bits required to carry out all computations is 



which is orders of magnitude smaller than n provided that /3 + e < 1. Thus, 
there is sufficient randomness at hand to do the job. How it is actually im- 
plemented is another matter, as there is some inevitable dependence between 
the data sets that correspond to cells and the data sets that correspond to 
their children. We will not worry about the finer details of this in the present 
paper. 
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Remark 3.2 For more on random tree models and their analyses, see the 
texts of Drmota [77] /. and Flajolet and Sedgewick fTSf . Additional material 
on information-theory and bit complexity can be found in the monograph by 
Cover and Thomas |2J/. 

4 A non-randomized cellular tree classifier 

The cellular tree classifier that we consider in this section is more sophisti- 
cated and autonomous, in the sense that it does not rely on any randomiza- 
tion scheme. It partitions the data recursively as follows. With each node we 
associate a set of M. d , starting with M. d for the root node. We first consider 
a full 2 d -ary tree (see Figure [3] for an illustration in dimension 2), with the 
cuts decided in the following manner. The dimensions are ordered once and 
for all from 1 to d. At the root, we find the median of (the projection of) 
the n data points in direction 1, then on each of the two subsets, we find the 
median in direction 2, then on each of the four subsets, we find the median 
in direction 3, and so forth. A split, contrary to our discussion thus far, 
is into 2 d parts, not two parts. This corresponds to Bentley's k-d tree [I]. 
Repeating this splitting for k levels of nodes leads to 2 dk leaf regions, each 
having at least max(n/2 dfc — 2, 0) points and at most n/2 dk . 




Figure 3: A full 2 d -ary tree in dimension d = 2. 

This procedure is equivalent to dk consecutive binary splits at the median, 
where we rotate through the dimensions. However, in our cellular set-up, 
such rotations through the dimensions are impossible, and this forces us to 
employ this equivalent strategy. Note, therefore, that the split parameter a is 
an extension of the binary classifier split a — one could consider it as a vector 
of dimension 2 d — 1, as we need to specify 2 d — 1 coordinate positions to fully 
specify a partition into 2 d regions. It remains to specify a stopping rule 6 
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which respects the cellular constraint. To this aim, we need some additional 
notation. 



Remark 4.1 By the very construction of the tree, at each node, the median 
itself does stay behind and is not sent down to the subtrees. From a topological 
point of view, this means that, in the partition building, each cell A and its 
2 d child cells A\, . . . , A 2 d are considered as open hyperrectangles. Thus, for 
classification, assuming nonatomic marginals, we would thus strictly speaking 
not be able to classify any data that fall "on the border" between A ± , . . . , A 2 d. 
This is a non-important detail for the calculations since the marginal distri- 
butions of X are nonatomic. In practice, this issue can be solved with an 
appropriate convention to break the boundary ties. 

If A is any cell of the full 2 d -ary tree defined above, we let N(A) be the 
number of Xj's falling in A, and estimate the quality of the majority vote 
classifier at this node by 

^ / n n 

Ln(A) = min f l[x ie A,r,=i], MxieAY^o] 

(Throughout, we adopt the convention 0/0 = 0.) 

Remark 4.2 Each cut at the median eliminates 1 data point. Thus, given a 
cell A, the construction of its offspring k generations later rules out at most 
1 + . . ,-\-2 dk ~ 1 = 2 dk — l observations. In particular, if A has cardinality N (A), 
then, k generations later, its offspring Ai,...,A 2 dk have a total combined 
cardinality at least N(A) - (2 dk + 1). 

Fix a positive real parameter a and define the nonnegative integer k + by 

k + = Lalog 2 (iV(A) + l)J, 

where, for simplicity, we drop the dependency of k + upon A and a. Finally, 
letting Vk+{A) be the 2 dk+ leaf regions (terminal nodes) of the full 2 d -ary 
tree rooted at A of height k + , we set 



L n (A,k + )— L n (Aj) 



N(Aj) 



N(A) ' 

The quantity L n (A, k + ) is interpreted as the total (normalized) error of a 
majority vote over the offspring of A living k + generations later. It should 
be stressed that both L n (A) and L n (A, k + ) may be evaluated on the basis 
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of the data points falling in A only (no matter what the rest of the tree looks 
like), thereby respecting the cellular constraint. 

Now, let (3 be a positive real parameter. With this notation, the stopping 
rule CD takes the following simple form: 



® Put 9 = if 






L n (A)-L n (A, k + ) 


~ \N(A) + l) ■ 



In other words, at each cell, the algorithm compares the actual classification 
error with the total error of the cell offspring k + generations later. This 
bounded lookahead principle suggested by us is quite well-developed in the 
artificial intelligence literature — see, for example, Pearl's book [33J on proba- 
bilistic reasoning. If the difference is below some well-chosen threshold, then 
the cellular classification procedure stops and the node returns a terminal 
signal. Otherwise, the node outputs 2 d sets of data, and the process contin- 
ues recursively. The protocol stops once all nodes have returned a terminal 
signal, and final decisions are taken by majority vote. Thus, for x falling in 
a terminal node A, the rule is as usual 

fx) = I 1 if 1 [ x > eA ' y »= 1 ] > S*=i 1 [x l eA,y,=o] 

yn{ ' \ otherwise. 

In the next section, we prove the following theorem. 

Theorem 4.1 Let g n be the cellular tree classifier defined above, with 1 — 
da — 2(3 > 0. Assume that the marginal distributions o/X are nonatomic. 
Then the classification rule g n is consistent: 

lim KL(g n ) = L* as n — > oo. 

n— ¥oo 

From a technical point of view, this theorem poses a challenge, as there are 
no conditions on the distribution, and the rectangular cells do in general 
not shrink to zero. In fact, it is easy to find distributions of X for which 
the maximal cell diameter does not tend to zero in probability, even if all 
is restricted to the unit cube. For distributions with infinite support, there 
are always cells of infinite diameter. This observation implies that classical 
consistency proofs, that often use differentiation of measure arguments or rely 
on asymptotic justifications related to Lebesgue's density theorem, cannot 
be applied. The proof uses global arguments instead. 
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For partitions that do not depend upon the Y values in the data, consistency 
can be shown by relatively simple means, following for example the arguments 
given in Devroye, Gyorfi, and Lugosi [10J. However, our partition and tree 
depend upon the Y values in the data. Within the constraints imposed by the 
cellular model, we believe that this is the first (and only) proof of universal 
consistency of a ^-dependent cellular tree classifier. On the other hand, we 
have proposed a model that is a priori too simple to be competitive. There 
are choices of parameters to be made, and there is absolutely no minimax 
theory of lower bounds for the rate with which cellular tree classifiers can 
approach the Bayes error. The work ahead is enormous and the road arduous. 



5 Proof of Theorem 4.1 



5.1 Notation and preliminary results 

We start with some notation (see Figure H]). For each level k > 0, we denote 
by Vk the partition represented by the leaves of the underlying full 2 d -ary 
median-type tree. This partition has 2 dk cells and its construction depends 
on the Xj's only. The labels Yj's do not play a role in the building of Vk, 
though they are involved in making the decision whether to cut a cell or not. 

For each Aj G P^, we let N(Aj) be the number of Xj's falling in Aj and note 

that z2j = i N(Aj) < n, with a strict inequality as soon as k > (see Remark 
14.21) . For each level k, Afc(X) denotes the cell of the partition Vk into which 
X falls, and N(Ak(X)) the number of data points falling in this set. 

We let /i be the distribution of X and rj the regression function of Y on X. 
More precisely, for any Borel-measurable set A C M. d , 

H(A) = P{X G A} 

and, for any x G M d , 

77(x) = P{Y = 1|X = x} = E[Y|X = x]. 
It is known that the Bayes error is 

min (77 (z), 1 — rj(z)) /i(dz). 

Let us recall that, for any cell A, 

j / n n 



J=l i=l 
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Figure 4: Some key notation. 



Also, for every k > 0, 



L n (A,k)= L n (Aj) 



Aj£V k {A) 



N(A) ' 



where Vk(A) is the full 2 d -ary median- type tree rooted at A of height k. 
the population level, we set 

L*(A) = -L min Qf ^(^^(dz), jf (1 - ^(z)) /i(dz)) 



and 



L*(A,k)= J2 L *( A > 



For all k > 0, we shall also need the quantity 

LJj = E[L*(Afc(X))]. 
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Note that whenever A = A(X. 1 , . . . , X„) is a random cell, we take the liberty 
to abbreviate J A dfj, by /i(A) throughout the manuscript, since this should 
cause no confusion. We write for instance 



L\ = E[E [L* (A k (X)) | Xi, . . . , X n ] ] = E 
instead of 



E 

AeV k 



E / ^ 



Our proof starts with some easy but important facts. 
Fact 5.1 

(i) For all levels k' > k > 0, 



L* < L\, < L\. 



(ii) For each cell A and each level k > 0, 



L n (A,k) < L n (A) + ^-^yV(A)>o] 
(Hi) For each cell A and all levels k! > k > 0, 



L n (A, k') < L n (A, k) + jj^l [N (A)>o]- 

(iv) For each cell A and all levels k, k' > 0, 

E[L*(A k (X),k')\ =Lt +k ,. 

In particular, for k" > k' > , 

L* <E [L*(A k (X), k")} < E [L*(A k (X), k')] . 

Proof Proof of statement (i) is based on the nesting of the partitions. To 
establish (ii), observe that, by definition, 



L n (A) 



2 2N(A) 



N(A)-2J2M^a,y, 



i=i 
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and 
L n (A, k) 



2N(A) E N ( A i) 2N(A) E 



8=1 



< 



Vi.l,) 2^1 X . 



2 2iVM) ^ 
But, by the triangle inequality and Remark [4.21 



i=l 



= 1] 



i=l 



JV(A) - 2 J]) lpL.6A.i5 

7V(^.)-2^1 [ x i ^, ) y i =i ] 



< E 

AePfc(A) 



i=i 



+ 2 dk - 1. 



This proves (ii). Proof of (Hi) is similar. To show (iv), just note that 



E[L*(A k (X),k')) =E 



E E 

E [L* (A k+k ,(X))} 
L 



J k+k' 



The next two propositions will be decisive in our analysis. Proposition 15.11 
asserts that the diameter of A k (X.) tends to in probability, provided k (as a 
function of n) tends sufficiently slowly to infinity. Proposition 15.21 introduces 
a particular level fc* which will play a central role in the proof of Theorem 

Proposition 5.1 Assume that the marginal distributions o/X are nonato- 
mic. Then, if 

k2 dk 



k — > oo and 



->0, 



n 



one has 



diam (A k (X)) — > m probability as n — )■ oo. 
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Proof of Proposition 15.11 Median-split trees are analyzed in some detail 
in the monograph by Devroye, Gyorfi, and Lugosi [TUl Section 20.3]. Start- 
ing on page 323, it is shown that the diameter of a randomly selected cell 
tends to in probability. The adaptation to our 2 d -ary median-type trees is 
straightforward. However, a few remarks are in order. Section 20.3 of that 
book assumes that all marginals are uniform. This can also be the set-up 
for us, because our rule is invariant under monotone transformations of the 
axes. Note however that it is crucial that splits are made exactly at data 
points for this property to be true. Also, the proofs in Section 20.3 of [TO] 
assume d = 2, but are clearly true for general d. The only condition for the 
diameter result is that of Theorem 20.2, page 323: 

k2 dk 

k — > oo and > oo. 

n 

The second condition is only necessary to make sure that the data medians 
do not run too far away from the true distributional medians. ■ 

Proposition 5.2 Let ip(n, k) be the function defined for all n> 1 and k > 
by 

4(n,k) = L* k -L*. 

(i) Let {&„}„>! be a sequence of nonnegative integers such that k n — > oo 
and k n 2 dkn /n — > 0. Then 

ip(n, k n ) — > as n — >■ oo. 



(ii) Assume that a G (0, 1/d) and, for fixed n, set 



k* = min {£ > : tp{n, £) < \l (—] 



l-da 

V n J 



Then 

> as n —¥ oo. 



n 



Proof of Proposition 15.21 At first we note, according to Fact 15.1( h). 
that for all n > 1 and k > 0, ip(n, k) > 0. For x G M. d , introduce 
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With this notation, 



1>(n,k)=E[L* (A kn (X))]-L* 

< E \ V (X) - fj n (X)\ + E |(1 - t?(X)) " (1 - U*))\ . 

Let us prove that the first of the two terms above tends to as n tends to 
infinity — the second term is handled similarly. To this aim, fix e > and 
find a uniformly continuous function t] e on a bounded set C and vanishing off 
C so that E|?y(X) — f/ e (X)| < e. Clearly, by the triangle inequality, 

E|r7(X)-77 n (X)| <E|77(X)-»fe(X)| 

+ E|?7 £ (X)-77 nie (X)| 
+ E|r K£ (X)-7 /n (X)| 

dcf 

= I + II + III, 



where 



Vn, 



By choice of one has I < e. Next, note that 



77 e (z)//(dz). 



IKE 



| % (X)- % (z)|/i(dz) 



M(^„(X)) 



As i] e is uniformly continuous, there exists a number 5 = 5(e) > such that 
if diam(v4) < 5, then |?7 £ (x) — r/ e (z)| < e for every x, z e A. In addition, there 
is a positive constant M such that |^ e (x)| < M for every x e M d . Thus, 

II < £ + 2M P {diam (A fcn (X)) > 5} . 

Therefore, II < 2e for all n large enough by Proposition 15.11 Finally, 
III < I < e. Taken together, these steps prove the first statement of the 
proposition. 

Next, suppose assertion (ii) is false and set, to simplify notation, 5 = 1— da > 
0. Then we can find a subsequence {k*.}i>i of {&*} n >i and a positive constant 
C such that, for all i, 



ru 



> C. 
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Since m — >■ oo, it can be assumed, without loss of generality, that m > 2 and 
log 2 (Cni) > 2d for all i. This implies in particular 



> 



\og 2 (Cnj 
d 

\og 2 (Crn 
2d 



1 



(5-1) 



and k* > 2 as well. 



On the one hand, by the very definition of /c*., 



m. 



> 




(5.2) 



On the other hand, by (15.11) and the monotonicity of ip(m, .) (Fact l5"?TY ii)). 
we may write 

\og 2 (Crii 



But, setting 



we have 



^(n h k* - 1) < ip n 



log 2 (Cn; 



2d 



2d ' 

t ni 2 dt *i \og 2 (Cm) fU 

V rti 



m 2d 

This quantity goes to as m — > oo. Moreover, t Ui — > oo and thus, according 
to the first statement of the proposition, 

i/j(m, kn — 1) — > as m — > oo. 



This contradicts (15.21). 



5.2 Proof of the theorem 

Let {k*} n>1 be defined as in Proposition 15.21 We denote by Q n the leaf 
regions of the cellular tree, and by (respectively, Q^*) the collection of 
leaves at level at most (respectively, strictly at least) A;*. Finally, for any cell 

A, we set 

L n (A) = P{<7„(X) /y,Xe A\V n }. 
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With this notation, we have 



L* < EL(g n ) = E 


E L ^ A ) 














= E 


E L n {A) 


+ E 


E 




K n 







Set 

Then, clearly, 

E e 



< E 



< E 



»iV(A) + l 



E ^) 



""n. 



= P { \l n {A k *(X)) - L„(4k*( x )> fe+)| > ^ (A fe *(X)) } . 
In the second inequality, we used the definition of the stopping rule of the 



cellular tree. Therefore, according to technical Lemma [6.51 
E 



E 



< O 



2<Jfc* 



n 



l-da-2/3 



Since 1 — da— 2/3 > 0, this term tends to as n — > oo by the second statement 
of Proposition 15.21 Next, introduce the notation 

n n 

N o( A ) = E 1 [x i e^,y i =o] and JVi(A) = E lpEieA,y<=i]> 



i=i 



i=l 



and observe that 
E 



E 



E 



E 1 1 ^o(A)>w 1 (^)] / r/(z)/i(dz) 
+ l[JV (A)<iV 1 (A)] / (l-77(z))/i(dz) 
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For x falling in the region covered by Q, * , denote by A,* (x) the cell of Q h * 

K n K n K n 

containing x, and set 



tf(^(*)) = £i[ 



X i6 A-,(x)]- 



i=l 



Letting 



we may write 



1 n 
' UX> = NiA-^h 1 ^. 



(x),r s =i]' 



E 



< E 



£ MA) 

E ^M^) 

+ E j 1 ^* 

AeSZ* 

+ £ | 1 [JV (A)<Ar 1 (A)] uf(l -r;(z))//(dz) - ^ (1 - 7) n (z)) //(dz] 



It follows, evoking Lemma [6.61 that 



E 





< E 


E ^n(A)M(A) 









n 



The rightmost term tends to according to the second statement of Propo- 
sition 15.21 

Thus, to complete the proof, it remains to establish that 

E L n (A)fi(A) 



E 



— > L as n — > oo. 
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To this aim, observe that by the very definition of G k * , we have 



E 



E **( A M A ) 



< E 



E 



E {Ln{A,k + ) + v {A))^A) 



+ E 



E ^M^) 



Ate: 



dcf 



I + 11. 



For every cell A of * , one has 



(5.3) 



Therefore, taking n so large that n/2 dkn > 2 (this is possible by Proposition 
I5.2( m)). we obtain 



II < 



n 
2 d K 



1 E 



E 



< 



n 

2 dk n 



Applying Proposition \5.2{ ii) again, we conclude that II — > as n — > oo. 
Next, define 



h 



odog 2 



n 



1 



and 



cdog 2 



n 

2*n 



+ 1 



Inequality (15. 3p implies that for every A e C/ fc * and all n large enough, 



Thus, by Fact \5AU ii) 



I < E 



E 



E L n (A,k n )/i(A) 
E L n (A, k n ) n{A) 



+ E 



+ o 



ndk', 

E 



Aeg; 



N(A)' 



2<ik* 



11 



l-da s 
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On the other hand, 



E 



L n {A,k n )n(A) 



< E 



E 



L*(A,k n )v(A) 



Aeg: 



L*(A,k n )fi(A) 



Aeg: 



E 



+ o 



\L n (A,k n )-L*(A,k n ) n(A) 



Aeg: 



2 dk n 



n 



l-da 



(by Lemma |6 
Consequently, 



I < E 



L*(A,k n )n(A) 



Aeg: 



+ o 



2 dk r, 



n 



l-da 



and the rightmost term tends to as n — > oo by Proposition \5.2j ii). Thus, 
the proof will be finalized if we show that 



E 



L*(A,k n )»(A) 



Aeg: 



— >• L* as n — > oo. 



We have 



E 



L%A,k n )v(A) 



Aeg: 



= E 

< E 
= L 



E E 

Aec- A 3 ev hn {A) 



V(A) 



Aev kn 



where, in the inequality, we use the fact that the cells in the double sum are 
at level at least k n . But, clearly, 



k n 2 dkn a log 



< 



n 



n 



n 



l-da ' 
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and consequently, since da < 1 



k n 2 



— >■ as n — >■ oo. 



?7 



Thus, by Proposition I5.2( i). the term L* k tends to L*. This concludes the 
proof. 

6 Some technical results 

Throughout this section, we adopt the general notation of the document. In 
particular, we let a and be two positive real numbers such that 1— da— 2/3 > 
0. The sequence {&*} n>1 is defined as in Proposition [5]2] and we set 



We will repeatedly use the fact that, by Proposition \5.2{ ii), 2 dk ™/n — > as 
n —7- oo. For any k > 0, Tk stands for the full 2 d -ary median-type tree with 
k levels of nodes, whose leaves represent Vk- 

Recall that X has probability measure fi on M. d and that its marginals are 
assumed to be nonatomic. The first important result that is needed here is 
the following one. 



Proposition 6.1 Let {k n } n >i be a sequence of nonnegative integers such 
that2 dkn /n^0. Then 



Proof of Proposition 16.11 In the sequel, we let n be large enough to 
ensure that n/2 dkn > 2, so that we do not have to worry about empty cells. 

To prove the lemma, recall the construction of Tk n - At the root, which 
represents M d , we order the points by the first component. We define the pivot 
as the r-th smallest point, where r = [(n+ 1)/2J , and cut perpendicularly to 
the first component at the pivot. Let the pivot's first component have value 
x*. Define 



k+ 



[a\og 2 (N(A) + l)\. 



(6.1) 




A = {x 6 R d : x 



,X d ),Xi < X*} 



and 



B = {x G R d : x 



Oi,. 



,X d ),Xi > X*} . 
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The sample points that fall in A, conditionally on the pivot, are distributed 
according to fi restricted to A, and similarly for B. Also, importantly, 



fJL(A) = Beta(r, n — r + 1) 



and 



jtt(-B) = Beta(n — r + 1, r), 



from the theory of order statistics (see, e.g., David and Nagaraja 0). 

We need to see how large (i(A), (i(B), N(A) and N(B) are. To this aim, we 
distinguish between the cases where n is odd and n is even. 

1. n odd. Now r = (n + l)/2, N(A) = r - 1 = (n - l)/2, iV(S) = 
n — r = (n — l)/2, and 



2. n even. In this case we have r = n/2, N(A) = (n— 2)/2, N(B) = n/2, 



As A^(A) + N(B) = n — 1, the pivot is not sent down to the subtrees. Let us 
have a canonical way of deciding who goes left and right, e.g., A is left and 
B is right. Next, still at the root, we rotate the coordinate and repeat the 
median splitting process for the sample points in A and B (both open sets) 
in direction 2, then in direction 3, and so forth until direction d. We create 
this way the 2 d children of the root and, repeating this scheme for k n levels 
of nodes, we construct the 2 d -ary tree up to distance k n from the root. It has 
exactly 2 dkn leaves. 

On any path of length k n to one of the 2 dhn leaves, we have a deterministic 
sequence of cardinalities 





and 




riQ = n(root),ni,n 2 , . . .,n kn . 



We have already seen that, for all z = 0, ... , k 



— r -2<n i <— f . 
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Now, consider a fixed path to a fixed leaf, (n , n±, . . . , n kn )- Then, condi- 
tionally on the pivots, the set of M. d that corresponds to that leaf, i.e., a 
hyperrectangle of R d , has /i-measure distributed as 



dcf 



Beta(ni + 1, n - ni) x . . . x Beta(n fcn + 1, n fen _i - n kn ) = Z x x . . . x Z k , 



def z _ 



Observe that 



Ez=n^=n- 



rii + 1 n kn + 1 



Also, 



i=l 



kn 



i=l 



rii-i + 1 n+1 



EZ 2 = JJez 2 = Jf 



j= A(n i _ 1 + l)(n i „ 1 + 2) 
The objective is to bound 



(wj + l)(7ii + 2) _ (n fcn + l)(wfc„ + 2) 

(n + l)(n + 2) 



E 



n k n 
n 



- Z 



< WE 



Z- 



n 



E\Z — EZ\ + 



n 



-EZ 



YZ + 



n 



-EZ 



>YZ + 



^kn 

n 



n^ + 1 



n + 1 



where the symbol V stands for the variance. Note 



n kn _ n kn + 1 
n n+1 





n k — n 




n(n + 1) 



< 



n + 1 



Also, 



YZ = 



nk n + l \ fn kn +2 n kn + 1 



n+1 J \ n + 2 n+1 

n^ + 1 \ w n- n kn 



< 



n + 1 J (n + 2)(n + l) 
n kn + 1 



(n + l)(n + 2)' 
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Thus, 



E 



ri 



Z 



< 



n kn + 1 1 
(n + l)(n + 2) (n + 1) 2 



< 



n + 



2. 



Sum over all 2 dkn sets in the partition Vk n , and call the set cardinalities 
nfc n (l), . . . ,nj, n (2 dkn ). Then, denoting by Zi the U Z" for the i-th set in the 
partition, we obtain 



E 



2 aK n 

E 



n 



< 



n+1 



2 dk n 



\ i=l \ i=l 
(by the Cauchy-Schwarz inequality). 



Therefore, 



E 



E 

i=l 



nknW 



n 



\/2 dkn i 

< xVn + 2 dfc " +1 



n + 1 



< 



y/2 dk * 



n + 



1 (y^ + v / 2*«+ 1 ) 



< \l — + . 



n 



n 



Since 2 dkn /n — > as n — > oo, this last term is 0(^2 dkn /n). ■ 

Corollary 6.1 Lei {A; n }n>i fre a sequence of nonnegative integers such that 
2 dkn /n ; and let be the partition ofM. d corresponding to the leaves of 
any subtree of J~k n rooted atM. d . Then 



E 



E 

a&vz 



N(A) 



n 



O 



n 



Proof of Corollary 16.11 The proof is similar to the proof of Proposition 
16.11 — just note that has at most 2 dkn cells. ■ 
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Proposition 6.2 Let {k n } n >i be a sequence of nonnegative integers such 
that 2 dkn /n 0. Then 



E 



E* 



= O 



n 



and, similarly, 
E 1 



= o 



E 1 * 



e^ fcn (X),y l= o] 



/^(^fcn( X )) ^A fc „(X) 



1 - r/(z))/i(dz) 



2dfc„ 



77 



Proof of Proposition 16.21 We only prove the first statement. Since 
n j2 dkn — > oo as n — > oo, we can always choose n large enough so that no 
cell of V kn is empty. A quick check of T kn reveals that given the pivots (see 
Proposition 16. ip . the points inside each cell are distributed in an i.i.d. manner 
according to the restriction of fi to the cell. Moreover, conditionally on X 
and the pivots, N(A kn (X.)) has a deterministic, fixed value. Thus, setting 



we obtain, conditionally on X and the pivots, 

1 A 1 



77(z)/x(dz), 



E 



N(A kn (X))^ 



E* 



[x iG A fcn (x),y l= i] 



/i(^(x))y. (x 



r/(z)/i(dz) 



< 



/ ^(X)(1-^(X)) 
N(A kn (X)) 



< - t 



1 



< 



2]]N(A kn (X)) 
1 / 1 



9 \l — ™ ? 



The result follows from the condition 2 dkn /n — > 0. 
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Corollary 6.2 Let {k n } n >i be a sequence of nonnegative integers such that 
2 dkn /n — > , and let be the 'partition ofM. d corresponding to the leaves 
of any subtree of T kn rooted at M. d . For each x e R , denote by Aj^ (x) the 
cellofV kn containing x. Then 



E 



= o 



ij(z)/i(dz) 



and, similarly, 
E 



(x),y i= o] 



^(^( X ))-/a L( x 



1 -?7(z))/i(dz) 



2*n 



n 



Proof of Corollary 16.21 The proof is similar to that of Proposition 16.2 
just note that 

n 

for all n large enough. ■ 

Lemma 6.1 Let {k n } n >i be a sequence of nonnegative integers such that 
2 dkn /n ->■ 0. Then 



E 



L n {A^Xfi-L* (A kn (X)) 



O 



2<ik n 



n 



Proof of Lemma 16.11 Using the definition of L n (A kn (X) ) and L* (A kn (X) ) , 
we may write 



E 



L n (A kn (X))-L* (A kn (X)) 



< E 



1 " 
N (A kn (X)) ^ 1[Xl 



1 



&A kn oq,Y l= i\ 



v(A kn (X-)) J Akn (x) 



7?(z)/i(dz) 



+ E 



1 n 



^{A kn {X))J Akn{ ^ 



[l- V ( Z ))fx(dz) 



Each term of the sum goes to by Proposition 16.21 
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Lemma 6.2 Let {k n } n >i be a sequence of nonnegative integers such that 
2 dkn /n — > 0, and let V k be the partition ofW d corresponding to the leaves of 
any subtree of Tk„ rooted at M. d . For each x e M. d , denote by A k (x) the cell 
ofV k containing x. Then 



E 



L n (A kn (X))-L* (A kn (X)) 



O 



n 



Proof of Lemma 16.21 The proof is similar to that of Lemma 16. 11 It uses 
Corollary 16.21 instead of Proposition 16.21 ■ 



Lemma 6.3 Let 

Then 



alog 2 



n 



E 



L n {A k *(X),k n ) - L*(A k *(X),k r , 



O 



2dk* 



71 



Proof of Lemma 16.31 We have 

E L n (A K (X),k n ) - L*(A k *(X),k n 

= E 



< E 



E E 
E E 

A ^k* A,eV kn (A) 



L n (A,] 

L n (A 3 : 



N(Aj) 
N(A) 


~ L\A 3 ) 


KA) 


N{A 3 ) 
N(A) 


- L n (Aj 


MAj) 
\{A) 



V(A) 
»(A) 



+ E 



E |^(^-)-^(^)|m^ 



dcf 



I + 11. 



II = E 



Clearly, 

whence, according to Lemma I6TT 

II = O 



L n {A K+kn (X)) - L* {A K+kn (X)) 



2 dk r, 



n 



l-da 
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On the other hand, since L n (Aj) < 1, 



I < E 



< E 



E E 

A ^k* A 3 eV kn (A) 

E E 

A 3 ^V kn (A) 



N(Aj) ^Aj) 



N(A) pi(A) 



N{A 3 ), A . N{A 3 ) 



+ E 



E E 



AT(A) 



n 



n 



The inequality 



N(A 3 )<N(A) 

A 3 ev kn (A) 



leads to 



I < E 



E 



E 

AeV k * 

E 

AG-Pfe* 
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»(A) 



N(A) 



n 



N(A) 



n 



+ E 



E 



E E 
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N(A 3 



n 
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AeV, 
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n 



Thus, by Proposition 16.14 



I = O 



2dk* 



n 



l-da 



Collecting bounds, we obtain 



1 + 11 = 



2 dk n 



n 



l-da 



Lemma 6.4 Let Q hi< be the collection of cells of Q n at level at most k*, and 
let 



k,. 



a log 2 I max 



n 



2 dk n 1 1 ' 1 
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Then 



E 



\L n (A,k n )-L*(A,k n ) 11(A) 



O 



2 dk n 



n 



l-da 



Proof of Lemma 16.41 Denote by Gu* the cells of Vk* such that the path 
from the root to the cell does not cross QZ*- By construction, the subset 
collection 



= g„ u g k + 

h n h n "'n 



is a partition of M. d represented by a subtree of 71* rooted at M. d . Moreover, 
clearly, 



E 



\L n {A,k n )-L*{A,k n ) fi(A) 



Aeg- 



< E 



J2 \L n {A,k n )-V(A,k n ) fi(A) 
Thus, denoting by A~^* (x) the cell of VjZ, containing x, we are led to 



E 



\L n (A,k n )-L*(A,k n ) 11(A) 



Aeg: 



<EL„(4. (X) , k n ) — L* (A' (X) , k r , 



The end of the proof is similar to the proof of Lemma 16.31 Replace Vk* by 
V k + and invoke Corollary 16.11 (instead of Proposition I6.ip and Lemma 16.21 
(instead of Lemma 16.11) . ■ 

Proposition 6.3 Let k + be defined as in \6. Then 



E 



L n (A K (X)) - L n (A K (K), k+) < ij(n, k* n ) + O 



n 



l-da 



where 



1>(n,k) = L%-L* 
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Proof of Proposition 16.31 For every cell A of V h * , one has 

{ ft \ 71 



Define 



k' 



alog 2 



n 



- 1 



and k r 



a\og 2 



n 

2<ik* 



+ 1 



32 \ 2 dk 

and note that, by inequalities ( 16. 2p . for all n large enough, 

k n ^ k ^ kyi . 

Thus, by the triangle inequality and Fact I5.1( ii). we may write 



E 



L n (A k *(X)) - L n (A k *(X),k+) 



< E 



L n (A k *(X)) -L n (A K (X),k + ) + 
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E 
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E 
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n 



Consequently, by Fact I5.1( m). 



E 



L n (A k .(X)) -L n (A k *(X),V 



< E 



L n (A K {X)) - L n {A K {X),k n ) 



+ o 



n 



With respect to the first term on the right-hand side, we have 



E 



L n (A k * n (X))-L n (A k * n (X),k n ) 
<e|l„ (A K {X))-L* {A k (X)) 

+ L t* - L * 

+ L*-E[L*{A k *{X),k n )] 

+ E L*(A k JX),k n ) - L n (A K (X),k n ) 
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According to Lemma 16.14 the first of the four terms above is 0(^2 dk «/n), 
whereas the third one is nonpositive by Fact \b.lt iv). Consequently, 



E 



L n (Afc*(X)) - L n (A fc *(X), k n ) < </>(n, k*) + O 



2 dk n 



n 



+ E 



L*(A K {X),k n ) - L n {A K (X),k n ) 



Evoking finally Lemma [6. 3 \ we see that 



E 



L n (A K (X)) - L n (A K (X), k n ) < ^(n, k*) + O 
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Combining this result with (16.31) leads to the desired statement. 
Lemma 6.5 Let k + be defined as in ( fg.jp . Then 



P 



L n (A K {X)) -L n (A K (X),k + ) 



> 



N(A k *(X))+l 



O 



2 dk n 



n 



l-da-2/3 



Proof of Lemma 16.51 Set 

Since N(A k *(x)) < n/2 dk ™, one has 



N(A) + 1 



P 



L n (A k *jX))-L n (A k *jX),k+)\><p(A k *(X))} 
L n (A k *(X)) -L n (A k * n (X),k+) 



< P 



1 



n/2 dk n + 1 



Therefore, by Markov's inequality, 

p{|L n {A k *(X)) - L n (A k *(X), k+)\ > if (A k *(X 
< (n/2 dk n + lfxE L n (A K {X)) - L n (A k *(X), k~ 



36 



Thus, by Proposition 

P 



L n (A K (X))-L n (A K (X),k + )\ >^(A fc *(X))} 
< {n/2 dk - + if x 



2<lk* 



But, by definition of k* 
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It follows, since n/2 n — > oo, that 



P- 



[\L n (Afc*(X)) - L n (A K (X), k + )\ > V (A)} = O 
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Lemma 6.6 Let Q h + be the collection of cells of Q n at level at most k* For 
x G M. d , denote by A^*(x) the cell of Q^* containing x ; and set N(A^(x)) = 



1 n 

=737 EWi* 



(x),y,=i]- 



Then 
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r) n (z)/i(dz) - / r/(z)/i(dz) 
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and, similarly, 
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Proof of Lemma 16.61 We only have to prove the first statement. To this 
aim, observe that 



E 
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r) n {z)fi(dz) - / r?(z)/i(dz) 

A J A 



E 



E 



1 x ^ 



fi(A) 
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Denote by Q ki , the cells of Vk* such that the path from the root to the cell 
does not cross . By construction, the subset collection 



"'n "'n 



is a partition of lR d represented by a subtree of Tk* rooted at M. d . Now, 
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But, since P fc * is a partition of M d , one has 



E 



E 



n 

^E* 



E 



i 



[Xj eA,y i= i] 



fi(A) 



r?(z)/i(dz) 



tt ^ W ^ =1J m(^(x)) y^ (x) 

This term goes to by Corollary 16.21 
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